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Abstract 

Sampling errors are inevitable when measuring the ocean; thus, to achieve a trustable set of observations 
requires a quality control (QC) procedure capable to detect spurious data. While manual QC by human 
experts minimizes errors, it is inefficient to handle large datasets and vulnerable to inconsistencies between 
different experts. Although automatic QC circumvents those issues, the traditional methods results in high 
rates of false positives. Here, I propose a novel approach to automatically QC oceanographic data based 
on the anomaly detection technique. Multiple tests are combined into a single, multidimensional criterion 
that learns the behavior of the good measurements, and identifies bad samples as outliers. When applied 
to 13 years of hydrographic profiles, the anomaly detection resulted in the best classification performance, 
reducing the error by at least 50%. An open source Python package, CoTeDe, was developed to provide 
state of the art tools to quality control oceanographic data. 
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1. Introduction 


Conservation of momentum, heat, and mass in the 


ocean depend on the seawater density (p) (see Gill 
(1982)), thus p is a necessary variable to describe 
and to understand processes at the most broad range 
of scales: from global sea level rise to oil and pollu¬ 
tant dispersion. Because variations of p are usually 
small, together with relatively large accelerations in 
the ocean, it becomes impractical for an instrument 
to make direct in situ density measurements along 


the water column (Backer Jr., 1981). As an alterna¬ 


tive, oceanographers infer p from temperature (T), 
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salinity (S), and pressure (P), using a Gibbs function 


formulation for seawater (Feistel 2008 Millero et al. 


2008 Backer Jr. 1981). The {T, S, P} structure is 


a fundamental building block of physical oceanogra¬ 
phy, thus errors in describing those variables com¬ 
promise any outcome conclusions: A persistent bias 
on T profiles from expendable Batlry-Thermographs 
(XBT) resulted in up to 50% error on the estimates 
of ocean warming and thermal expansion in recent 


decades (Cowley et al. 2013 |Domingues et al. 2008 


Levitus et al. 2009); Profiles from Argo floats lacking 


correction for pressure sensor drift misled to spurious 


variability on the {T, S} vertical structure (Barker 
et al. 2011). The robustness of a scientific study is 


tied to the quality of the data that grounded it. 


The marine environment is harsh for electronic sen¬ 
sors, making inevitable to have some spurious mea¬ 
surements. In response to that, data distribution cen- 
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ters and coordinated observing programs have estab¬ 
lished clear quality control (QC) procedures, provid¬ 
ing measurements with quality flags. Some of the 
widely used procedures were defined by: The Global 
Temperature and Salinity Profile Programme (GT- 


et al. 2012), that combines different factors through 


SPP) ( |UNESC O- IOC , 2010), the European Global 
Ocean Observing System (EuroGOOS) of the In¬ 
tergovernmental Oceanographic Commission of UN- rate (Bettencourt et al. 


CSIRO XBT Program (Bailey et al. 1994), the Aus¬ 
tralian Integrated Marine Observing System, and 


an empirical network of relations to perform a quality 
classification by statistical inference. Such intricate 
decision making comes with a price: a slow learn¬ 
ing rate that must outcome the natural variability. 
Therefore, to calibrate Bayesian Networks requires 
a larger volume of data and/or a higher sampling 

2007] ). Coastal monitoring 


ESCO (IOC GOOS) (DATA-MEQ working group sensors with fixed positions may satisfy that require- 


2010), the Argo Program ( 

Wong et al. 

20151, the ment, allowing for a skillful QC classification 1 

Smith 


the Integrated Ocean Observing System (QARTOD 


group 2016). While each type of sensor requires some 


specific steps in the QC procedure, the recommenda¬ 
tions mentioned above have several tests in common 
and share the same general structure of a sequence 
of independent tests. A major weakness of this ap¬ 
proach is the lack of context awareness to qualify the 


data (Smith et al. 2012), often compensated by corn- 


integrate, multiples datasets (Morello et al. 2014). 


Thus, improvements in measuring the ocean must be 
followed by advances in oceanographic QC methods 
to reduce the burden on the human experts without 
compromising the classification skill, and to allow for 
efficient and coherent data aggregation. 

Developments in data mining and machine learn¬ 


ing have been revolutionizing data analysis (see Ivezic 


et al. (2014) for a nice introduction on this subject). 


Such modern techniques provide a convenient frame¬ 
work to improve automatic QC by using supervised 
learning, which reduces the discrepancy with the hu¬ 
man expert evaluation. From a machine learning per¬ 
spective, oceanographic data QC is a classification 
task where the simplest setup is composed by only 
two categories: good or bad data. The most power¬ 


et al. 2012 Rahman et al. 2013 2014), so that fu- 


plementary manual QC. Manual evaluation by ex¬ 
perts is indeed the best current option to minimize 
errors, mostly due to the relatively limited amount of 
measurements in the oceans and the adaptive skills 
of the human brain in pattern recognition. However, 
the efficiency of an automatic QC better suits real¬ 
time operations, such as weather forecast, as well as 
the processing of large datasets. Also, inconsistencies 
in manual QC can pose a problem to compare, or to 


ture improvements for similar scenarios shall come 
from tuning the Bayesian Network rather than em¬ 
ploying a completely different technique. That is not 
necessarily true for other types of sensors with differ¬ 
ent sampling strategies. 

Because a proper sampling procedure should mini¬ 
mize spurious measurements, the QC of hydrographic 
data is an unbalanced classification problem. The 
number of good measurements is typically at least 
two orders of magnitude larger than the available bad 
samples, which critically compromises the calibra¬ 
tion of most machine learning techniques, including 
Bayesian Networks, Support Vector Machines, and 


Neural Networks. Rahman et al. (20141 brought an 


important contribution by proposing to use cluster 
undersampling to circumvent the unbalancing issue, 
however that might not fully addresses the problem. 
The data used on the calibration phase still must sta¬ 
tistically represent each one of the categories being 
classified, but cluster undersampling cannot guaran¬ 
tee that. Spurious data have no bounds of feasibility, 
being, therefore, relatively more variable than valid 
data, yet representing the smallest fraction of the 
measurements. If the available sample of bad data 
does not statistically represent all possible spurious 
measurements, the calibration would minimize the 
classification error for that particular dataset, but it 
would not necessarily be a good predictor for new bad 


samples, an issue known as overfitting (Ivezic et al. 


2014). Even though cluster undersampling tackles 


ful technique used so far is Bayesian Networks (Smith 


the unbalancing issue, to calibrate a Bayesian Net¬ 
work still requires a sufficient amount of bad samples, 
which is unusual for oceanographic profiles which are 
sparse in space and time. 

A machine learning technique more adequate to 
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QC oceanographic data is the anomaly detection, 
which learns the behavior of the good data and identi¬ 
fies the bad data as an outlier. Therefore, neither the 
relative, nor the absolute, sample size of the bad data 
compromises the QC classification, allowing to iden¬ 
tify spurious measurements even if that kind of error 
has never been observed before. Previous implemen¬ 
tations of anomaly detection for environmental mea¬ 
surements were based on the comparison of multiple 


sensors 

Bettencourt et al. 

20CT 

I) or auto-regressive 

models ( 

Hill and Minsker 

2010) 

which would not be 


the most appropriate for oceanographic profiles. In 
a novel approach, I propose to search for outliers on 
the characteristics of the data instead of the mea¬ 
sured value itself. This is somehow aligned with the 


vision of Gronell and Wijffels (2008), but with the 


2. Methodology 

All techniques and QC tests discussed on this study 
were implemented as a collection of independent 
modules in the open source package CoTeDe, what 
makes it easier to extend for new tests, and gives the 
user flexibility to apply them. The user can customize 
any desired set of tests, including the specific param¬ 
eters and thresholds of each test. Otherwise, there 
are preset QC procedures conforming with the rec¬ 


ommendations from GTSPP (UNESCO-IOC, 2010) 


EuroGOOS (DATA-MEQ working group 2010) or 
ARGO (Wong et al. 2014). In addition to that, it 


is also implemented innovating approaches based on 


Fuzzy Logic (Timms et al. 2011 Morello et al. 2014) 


advantages of the anomaly detection framework. 

Although the observing programs provide data 
with quality flags, all observational scientific stud¬ 
ies require to quality control its own freshly collected 
data for its own use, even before providing it to the 
data centers. Non-operational studies can rarely af¬ 
ford a QC specialist in their team, creating an over¬ 
head and risking the quality of their data products 
and scientific conclusions. This manuscript intro¬ 
duces an open source Python package, named CoT¬ 
eDe, that provides in a single tool the different pos¬ 
sibilities on the state of the art to quality control 
oceanographic data. CoTeDe is optimized to attend 
data centers with large volumes of data, while flexi¬ 
ble enough to accommodate a diverse combination of 
procedures and fine tuning required by specific stud¬ 
ies. Such flexibility allows to provide the traditional 
QC approach, as well as modern powerful solutions 
like anomaly detection and fuzzy logic. The perfor¬ 
mance of the different techniques are compared in a 
case study of CoTeDe on a real hydrographic dataset, 
described in Section [2T| The traditional QC proce¬ 
dures are briefly reviewed in Section [2.2| the anomaly 
detection is introduced on Section |2.3[ and a fuzzy 
logic approach based on Morello et al. ( ]2014| ) is pre¬ 
sented on Section [2741 The technical details on run¬ 
ning CoTeDe are left for the user manuaQ 


and Anomaly Detection. The different approaches to 
QC using CoTeDe are illustrated using the dataset 
described below. 

2.1. Data 

The data used to illustrate and discuss CoTeDe is 
the historical hydrographic CTE^] dataset from the 
Prediction and Research Moored Array in the At¬ 
lantic (PIRATA) ( |Servain et al.| |1998 ). It is com¬ 
posed of 194 CTD profiles, sampled between 1998 
and 2011, with over 380,000 measurements of pres¬ 
sure, temperature, and salinity; Figure |Tj illustrates 
one of these profiles. The positions of the stations 
vary along the years, all being nearby the western 
PIRATA buoys on the western Tropical Atlantic, be¬ 
tween 15°N 38°W and 19°S 34°W. This dataset is 
provided by the Brazilian Navy - Banco Nacional de 
Dados Oceanografico£0 (BNDO). 

The first task to quality control is to properly ex¬ 
tract all the data and metadata available in the CTD 
output files. That represents a challenge when us¬ 
ing historical data because of the diversity of for¬ 
mats, even from the same manufacturer. The so¬ 
lution adopted was to create a standalone package, 
named SeabircQ to normalize all the data in one com¬ 
mon easy-to-use format. Seabird is an open source 


1 http://cotede.castelao.net 


2 An electronic sensor that measures conductivity, temper¬ 
ature, and pressure. 

3 https://www. mar. mil.br/dhn/chm/chm_new/bndo.htm 
4 http:seabird.castelao.net 
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Figure 1: Vertical profiles of temperature (green) and salinity 
(orange) approximately at 4°N 38°W on 2008/04/17, from 
the Brazilian PIRATA hydrographic database. Only the data 
approved on the quality control procedure recommended by 
the EuroGOOS is shown. The sharp change of salinity around 
950dbar is due to sampling errors, but missed by that quality 
control. 


Python package, developed with the goal to process 
the outputs from SBE CTDs and thermosalinographs 
(TSG). For each data file, a regular expression that 
matches up with the content is used to parse the data. 
In the case of a new format, the existent regular ex¬ 
pressions can be adjusted or a completely new one 
created, but the common engine is preserved. With 
this structure it becomes trivial to extend CoTeDe 
for a new type of dataset, only requiring a package 
to parse the raw data on the expected data object. 
Initially developed for CTD, CoTeDe is already ex¬ 
tended for TSG and ARGO data. 

2.2. Traditional Quality Control 

Oceanographic data have been traditionally qual¬ 
ity controlled using a collection of independent tests, 
each one seeking for a known signature of bad mea¬ 
surements. These tests could be grouped in two 


types: the first one checks for missing, or invalid in¬ 
formation, for example, if the measurement is asso¬ 
ciated with a valid date and valid geographic coordi¬ 
nates; The other group of tests compares a character¬ 
istic of the data against a previously defined window 
of acceptable values. The most intuitive test on this 
group is the global range, whose thresholds delimit 
feasible values in the oceans of the property being 
considered, for example the temperature itself. Even 
though this is a robust criteria, it does not cover spu¬ 
rious data within the range of feasible values. One 
solution to address that is to project the original 
data onto dimensions that emphasize characteristics 
of bad measurements, with the goal to obtain a new 
space where good and bad data spread apart from 
each other. Each projection is hereinafter referred as 
a feature of that measurement. 

On this manuscript I will use x for a set of mea¬ 
surements of a given variable x, that could be tem¬ 
perature for example, and y n will be the n th feature 
of x, so that y n = y n (x). The index i refers to one 
specific measurement, like Xi, thus i — 1 {i + 1) is the 
previous (following) measurement in the data series. 
Given that, some examples of features widely used 
are: 

Rate of Change: Evaluates the change from the 
previous measurement as: 

y r =Xi-Xi- 1 . ( 1 ) 


Gradient: Evaluates the rate of change surrounding 
the measurement, defined as: 


Vg = 



( Xj+i + Xj-Q 
2 


( 2 ) 


Spike: Evaluates how contrasting a measurement is 
in comparison with the adjacent successive mea¬ 
surements, defined as: 


Vs =y g - 


(*£z+l %i— i) 

2 


(3) 


where y g is the gradient given by equation [2j 


Tukey 53H: Evaluates how contrasting a measure¬ 
ment is in comparison to the low frequency ten¬ 
dency of the data series. It takes advantage 
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of the robustness of the median to create a 
smoother data series which is used as reference 


(Goring and Nikora, 2002), with the following 


procedure: 

1. x« is the median of the five points from 

Xj—2 to X£_|_2 , 

2. x^ is the median of the three points from 


X (1) to x (1) • 
x i-1 LU x i+l i 


3. x’( 3 ) is defined by the Hanning smoothing 
filter: 


i(- 


- (x,^ +2x, (2) 


r (2) ■ 

l i+ 1 J ’ 


4. Finally: 


-xO)| 


yt = 


(4) 



where a is the standard deviation of the lowpass 
filtered data. 


Climatology: Evaluates the bias between the ob¬ 
served measurement and a climatology, normal¬ 
ized by the expected variability in that point and 
time, using the relation: 


Vc = 


\*i - ( x)\ 

a 


(5) 


where (x) is the climatology, and a is the stan¬ 
dard deviation of the observations used to create 
the climatology. Commonly used climatologies 


are the World Ocean Atlas (WOA) (Locarnini 
et al. 2010[ Antonov et al.| 2010) and the CSIRO 


Atlas of Regional Seas (CARS) (Ridgway et al. 
2002 ] ) 


Figure [2] illustrates the traditional QC approach 
with the gradient test, applying a threshold limit to 
the feature gradient. Data flagged by the global range 
test was already removed, remaining some spurious 
measurements near the depth of 1000 dbar. The fea¬ 
ture gradient projects the bad data into a distinct 
scale of the regular temperature observations (see 
Fig. [2j3) . The variability near 1000 dbar suggests 
that the threshold used missed some bad data, i.e. 
some false positive flagging, a subject explored in the 
following sections. 


Figure 2: (A) Temperature profile of station #10 from the 
PIRATA-X cruise; the green line is the data approved by the 
gradient test, while the orange triangles failed. (B) The same 
data plotted in respect to the gradient (Eq.: [ 2 ]) show a distinct 
scale for the bad values. The gray area delimits the threshold 
according to GTSPP. 


CoTeDe does not modify or remove any measure¬ 
ment, but returns an overall quality flag for each 
input value according to the scale recommended by 
the Intergovernmental Oceanographic Commission of 


UNESCO (Table [T |), a widely adopted flag standard 


(UNESCO-IOC, 2010 DATA-MEQ 

working group, 

2010 

SeaDataNet, 2010 Wong et al. 

2014 

). The fi- 


nal flag of each measurement is the maximum flag 
value obtained among all performed tests, i.e., it is 
only considered good (flag 1) if approved by all tests. 
CoTeDe’s manuaj^] provides a full list of implemented 
tests together with the parameters and thresholds 
recommended by different groups. 

2.3. Anomaly Detection 

Anomaly detection is a classification technique to 
discriminate commonly observed data from anoma¬ 
lies. While other classification methods try to de¬ 
scribe each one of the classes being considered, the 


’http://cotede.castelao.net 
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Table 1: Quality control flags recommended by IOC- 
UNESCO, and adopted in CoTeDe. 


Flag 

Meaning 

0 

No QC was performed 

1 

Good data 

2 

Probably good data 

3 

Probably bad data 

4 

Bad data 

6 

Below detection limit 

9 

Missing data 


anomaly detection approach focusses in recognizing 
the common data. By assuming that spurious mea¬ 
surements are anomalous responses of the sensors, it 
provides a solution to quality control with less sen¬ 
sitivity to the bad data sample size. Further, by 
avoiding specific patterns to recognize bad data, the 
anomaly detection promptly identifies unprecedented 
measuring failures, while other techniques would re¬ 
quire to explicitly learn the new pattern first. 


This technique is not new to environmental mea¬ 
surements. Bettencourt et al. (2007) applied anomaly 


detection to a network of inland synchronous sen¬ 
sors, identifying bad samples even when they had 
feasible magnitudes. To achieve that, those authors 
compared each measurement with previous measure¬ 
ments as well as to neighbouring sensors, and de¬ 
tected the anomalies using a p-test. To evaluate 
the data by the magnitude itself requires a station¬ 
ary timeseries or a sampling rate sufficiently high to 


overcome the environment changing trend (Betten- 
court et al.| 2007). That is an issue for oceanogra¬ 


phers, since few marine datasets would meet those 
requirements, and to aggravate that, duplicate mea¬ 
surements in the deep ocean are restricted to mod¬ 
ern CTD casts. From a different argument, |Hill and] 


Minsker (2010) proposed an equivalent concept for 


the case of individual sensors using sequential mea¬ 
surements into auto-regressive models, including a 
perceptron type of artificial neural network, to pre¬ 
dict the following value in the timeseries. These au¬ 
thors used a pre-defined limit of confidence on the 
prediction to obtain the range of tolerance, so a value 
outside that would be an outlier, and assumed to be 


a bad measurement. This solution can handle small 
gaps as long as the sampling rate is sufficiently high, 
but it also requires a stationary timeseries, or regular 
update on the model parameters. Thus, the method¬ 
ologies used so far in environmental systems are not 
adequate to quality control oceanographic observa¬ 
tions, specially deep ocean measurements. 

The alternative that I propose to use anomaly de¬ 
tection in oceanographic data is to project the orig¬ 
inal variable in dimensions that emphasize different 
characteristics of the measurement, and then evalu¬ 
ate how anomalous those projections are instead of 
evaluating the measurement itself. The features used 


in the traditional Q.C. (Section 2.21 suits well that 
task since those were designed to explore known char¬ 
acteristics of bad data. Although, instead of testing 
against fixed thresholds, those are used to character¬ 
ize the measurement in another scale, for example 
a gradient intensity, i.e. the output of the equation 
| (see Fig.: [2^). Each feature aggregates a new per¬ 
spective of the measurement into a multi-dimensional 
criteria, allowing for a more flexible non -linear classi¬ 
fication. The full procedure implemented in CoTeDe 
is explained in detail as follows. 

The first task is to characterize the typical behav¬ 
ior of the data by estimating a probability density 
function (PDF) for each feature (y n ). Since the goal 
is to identify anomalies, and the bad measurements 
are usually much less than 1%, any value below the 
90^ percentile is considered common, thus lacking 
evidence of being a bad measurement. The PDF is 
hence estimated using only the top 10% values of y n , 
allowing a better fit in the range of interest. The best 
results that simultaneously satisfied the different fea¬ 
tures were obtained from the exponentiated Weibull 
continuous function, defined as, 


PDF(y|fc,A,a) = o^(|) 


k -1 r 


- e -oo* 


( 8 ) 

where k , A, and a are the adjustment parameters, and 
y is a feature of the variable to be classified, for exam¬ 
ple the gradient (Eq: [2j of measured temperatures: 
y g = j/ 9 (T). The respective survival function (SF) of 
the estimated PDF is used to quantify how anoma¬ 
lous a certain measurement is. For a feature y , SF(yj) 
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Figure 3: (A) Distribution of the top 10% gradient test results 
(green), and the respective survival function (orange). (B) Dis¬ 
tribution of the top 10% climatology comparison test (green), 
and the respective survival function. Only the data approved 
by the EuroGOOS QC procedure is considered. 


gives approximately how frequent a valid measure¬ 
ment x was observed with y > y*. Hence the higher 
y i, the smaller the SF(yj), and more anomalous is x,; 
in the perspective of the feature y. Figure[3]illustrates 
the top 10% of gradient and climatology of tempera¬ 
ture from the PIRATA dataset. Only 10% of the ob¬ 
servations had a gradient aver 0.013, therefore, values 
equal or below that suggest a regular good sample, 
i.e. such gradient lacks any indicative of being a bad 
data. In another case, SF e (y g = 0.1) = 0.077, hence 
there is a 7.7% chance of obtaining a valid sample x 
with y g > 0.1. 

Assuming independent features, the probability of 
observing a good measurement (xj), characterized by 
a set of features {y g (xi), y c (xj), y s (x,;)}, or a more 


rare scenario, is given by the product of the individual 
probabilities, i.e. 

P = II SFn ( yn )’ ( 7 ) 

n 

where y n is the n th feature of the measurement Xi. 

Finally, it is necessary to define a probability 
threshold (p) in order to distinguish an expected good 
data from an anomaly. To obtain that, the procedure 
recommended by the EuroGOOS was taken as a good 
first guess. The data approved by the EuroGOOS 
procedure was randomly split in 3 subgroups: fit, 
test, and error estimate groups, with 60%, 20%, and 
20% of the valid observations respectively. The non- 
approved data was randomly split in half, with each 
half included in the test and error estimate groups. 
The PDF coefficients were adjusted based only on the 
fit group, hence, expected to be mostly, if not fully, 
composed of actual good data, because it is expected 
a tiny fraction of false positives approved by the Eu¬ 
roGOOS procedure. Therefore, the survival functions 
are indicative of how common that result is observed 
among the good data. The threshold p was defined 
to minimize the sum of false positive and false neg¬ 
ative cases considered in the test group. The error 
of the anomaly detection approach was estimated by 
applying the p from the previous step on the last data 
subgroup, the error estimate group. Since the data 
on the error estimate group is not used on the adjust¬ 
ing procedures, this is an unbiased error estimate. In 
summary, if the probability is greater than the 
threshold p, x^ is flagged as good (1), otherwise it is 
flagged as probably bad (3). 


2-4- Fuzzy Logic 

In contrast to the typical crisp threshold tests, 
which results in a binary quality evaluation, the fuzzy 
logic approach seeks a continuous quality scale, with 
a fuzzy transition between good and bad data. Each 
measurement is evaluated in a higher dimensional 
space by combining multiple features together, which 
allows a classification criterion with more degrees of 
freedom and a decision with better context aware¬ 
ness. One way to fine tune this technique is by ad¬ 
justing the ranges associated with lower or higher un¬ 


certainty. That is an outstanding advantage (Morello 
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et al. 2014) since the meaning of the adjusting pa¬ 


rameters is not hidden in the math of the procedure, 
like in other techniques, thus the human expert can 
intuitively associate those parameters with the real 
world. 

A sequel of manuscripts (Timms et al. 2011 


Morello et al. 2011 2014) proposed a fuzzy logic pro¬ 


cedure to quality control hydrographic data summa¬ 
rized in the following steps: 

1. Each variable is evaluated by multiple features 
of the measurement. |Morello et al. (20141 eval¬ 
uated the water temperature using: climatology, 
spike , and rate of c/iang^] but more features can 
be added or exchanged keeping the same general 
procedure. 

2. Each feature is mapped into three fuzzy sets, i.e. 
three scales of uncertainty: low, medium, and 
high. The scaled feature is called membership 
of the fuzzy set. For example, a temperature 
measurement identical to the climatology sug¬ 
gests high confidence, therefore, the feature cli¬ 
matology shall result in a membership equal to 
1.0 for low uncertainty and memberships equal 
to 0.0 for medium and high uncertainty. Thus, 
each measurement results in three memberships 
times the number of features evaluated. 

3. Fuzzy rules group the different fuzzy set into 
combined memberships for each measurement. 
The high level is combined as the maximum 
value among all memberships for high uncer¬ 
tainty, while the low and medium levels are each 
one combined by the mean of its respective mem¬ 
berships. While several factors are taken into 
account to consider some data as good, just one 
kind of error is sufficient to characterize a bad 
data, hence, it is the maximum value that leads 
the decision for the high level of uncertainty. At 
this stage, each measurement is associated to 3 
different levels of uncertainty, i.e. 3 values, inde¬ 
pendent of how many features were evaluated. 


®The rate of change as defined by |MoreIlo et a l. (2014|l is 
actually equivalent to the widely used gradient (eq. [2j, while 
Timms et al.| (|2011|) defines rate of change as presented in eq. 

0 


4. The traditional flag scale (Table [l]) is obtained 
according to: 

• Flagged as good (1) if the low uncertainty 
level is higher than 0.9; 

• Flagged as probably good (2) if low uncer¬ 
tainty level is higher than 0.5 and high un¬ 
certainty level is lower than 0.3; 

• Flagged bad (4) data if a threshold is 
crossed; 

• Everything else is flagged potentially cor¬ 
rectable (3); 


Such procedure does not use the medium uncer¬ 
tainty level to obtain the traditional flag scale, so it 
is actually based in only two levels of uncertainty, low 
and high. The bad data (flag 4) is identified like the 
traditional QC, therefore, the effective improvement 
of this implementation is to aggregate this contin¬ 
uous transition through probably good (flag 2) and 
probably bad (flag 3) data giving more freedom to 
minimize false positives and false negatives. 

CoTede provides the above-mentioned procedure, 
along with an alternative implementation of fuzzy 
logic that effectively uses the medium uncertainty set, 
allowing for a better resolution in the quality scale. 
Also, it defuzzifies by using the centroid of the com¬ 
bined memberships instead of the step 4 described 
above. The final product is a quality level between 
0 and 1 for each measurement. To better illustrate 
those procedures, the Supplementing Material con¬ 
tains some study cases, and for more details on the 
technique the reader is referred to CoTeDe’s manual 


together with the original manuscripts (Timms et al. 


2011 

Morello et al. 

2011 

2014 


3. Results and Discussion 

The quality control procedure for hydrographic 
data traditionally consists of a sequence of indepen¬ 
dent tests. Although there are different recommen¬ 
dations on which batch of tests to use, the general 
form is the same: each test checks a feature against 
a threshold for acceptable values. The outcome is 
hence highly sensitive to the threshold chosen, as a 
wide (strict) limit favors false positives (negatives). 



































Figure [2] illustrates that dilemma, where the gradient 
test misses some spurious measurements near 1000 
dbar in order to avoid to misclassify the intense gradi¬ 
ent of the thermocline, near the surface. Because the 
gradient test classifies the data without any other in¬ 
formation than the gradient , the calibration is limited 
to increase or to reduce the threshold value. Since 
failing in one test is sufficient to flag the data as bad, 
the strategy usually adopted is to calibrate the tests 
to minimize false negatives, expecting that the false 
positives would be identified by another test. There¬ 
fore, the traditional QC performs well on flagging bad 
data, and improvements to reduce the burden on hu¬ 
man expert QC should target false positives. 

The detection of unfeasible values is a trivial pro¬ 
cedure, so the real challenge is to identify bad mea¬ 
surements within the range of possible magnitudes. 
Thus, all results and considerations hereinafter are 
for the PIRATA dataset after discarding the failures 
on the global range test (Section |2. 2 [ ), which removes 
0.13% of the full dataset. 

Figure [4] projects the temperature measurements 
in respect to climatology and Tukey 53H. The obser¬ 
vations are flagged as good (green) or bad (orange) 
according to EuroGOOS recommendations for real¬ 
time data. Since that flagging lacks tests with this 
two features, the classification is independent of the 
projected dimensions. The gray rectangles delimit 
climatology over 6 and Tukey 53H over 1.5, illustrat¬ 
ing the traditional QC procedure. Tukey 53H test 
agrees with EuroGOOS classification capturing only 
bad data (gray box on Fig. |4jA), thus it would be re¬ 
dundant if aggregated into EuroGOOS, at least with 
such threshold. In contrast to that, the climatology 
test flags some data otherwise classified as good (gray 
box on Fig. [4^3) . Considering only the data already 
approved in all other tests from EuroGOOS (green), 
a climatology test with threshold of 6 would flag an¬ 
other 0.06% of the data as probably bad (flag 3), 
while a threshold of 3, as recommended by GTSPP, 
increases that to 1.05%. That is higher than what 
would be expected for a normally distributed data. 
Manual classification confirms that a threshold of 3 
(6) results in more than 1% (0.05%) of the dataset as 
false negatives, i.e. the climatology test recurrently 
flags uncommon real events as bad data. The feature 


climatology is not equally distributed around 0 as it 
would be expected, but with a median of 0.3, hence 
most of those failures were due to warm anomalies. 
This result suggests that datasets quality controlled 
by the largely used GTSPP standard would atten¬ 
uate any long term trend, like in the case of global 
warming. The climatology test, as it is, assumes a 
normally distributed stationary time series. If that is 
violated, such test would systematically reject good 
data, modifying the spectrum of the final product. 
Another potential issue is regions with insufficient 
historical observations to properly represent the local 
variability. Nonetheless, since the climatology test 
identifies bad data otherwise missed by other tests, 
CoTeDe uses climatology in the anomaly detection 
procedure, but with two modifications. First, the 
hard limit threshold is increased to 10 standard de¬ 
viations instead of 3; Second, the standard error of 
the climatology is discounted from the difference be¬ 
tween the climatological mean and the measurement. 
Regions with fewer observations have larger standard 
error, i.e. more uncertainty in the climatology, there¬ 
fore, the comparison should be more tolerant to dif¬ 
ferences. 

While it is hard to define an optimal threshold for 
each unidimensional projection, due to the superpo¬ 
sition between good and bad data (see A and B on 
Fig. 0, the bidimensional space shows a clear polar¬ 
ization between the two classes (see Fig. @py The 
black dashed line is a better criterion than the gray 
rectangles to classify the data, but such slope is not 
possible when evaluating the features one at a time. 
The traditional QC procedure is equivalent to widen 
or to shrink the gray rectangles, but always keep¬ 
ing the same shape. The upper edge of the good 
data cluster (in green on Fig. E p) would be flagged 
as bad by the traditional approach due to climatol¬ 
ogy above 6, but manual classification points those as 
valid measurements from anomalous natural events. 
A major advantage of the expert QC comes from the 
context awareness (Smith et al. 2012), by consider¬ 


ing more information to evaluate cases not obvious at 
first glance. Techniques like anomaly detection, fuzzy 
logic and Bayesian Networks combine features into a 
multidimensional criterium achieving a superior skill 
than multiple unidimensional tests. The sparse bad 
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data (orange dots, Fig. |4p) in the middle of the clus¬ 
ter of good data (green cloud, Fig. E p) , with small 
values of climatology and Tukey 53H 1 illustrates how 
the projections can be orthogonal, thus reinforcing 
the demand on multiple features to identify spurious 
data. A multidimensional space analysis allows for a 
criterium with more degrees of freedom, and, with a 
careful set of features, the good data is identified as 
a distinct cluster. 

io 5 
10 3 
10 1 


icr 6 i(r 5 io 4 i(r 3 icr 2 10 1 10° 10 1 , 

r ~, . . . - n ---- - ,., , 10 3 



10 1 10 2 10 3 
Number of good observations 


Figure 4: Observations of the PIRATA-Brazil hydrography in 
respect to Tukey 53H (A) and climatology (B). The good data 
are in green, and the bad data in orange, according to the 
EuroGOOS recommendation for realtime. The gray boxes de¬ 
limit Tukey 53H and climatology above 1.5 and 6, respectively. 
The black dashed line is an approximate threshold between the 
good and bad data clusters. 

Figure [5] illustrates a profile of temperature ap¬ 
proved by the EuroGOOS criteria. The zoom around 
724 dbar shows a questionable abrupt change on the 
profile. The features are not large enough to be 
individually considered bad data by the traditional 
QC thresholds (see Table [2]), but the anomaly detec¬ 
tion approach identifies this measurement as a dis¬ 


tinct structure in the profile (see Figure [5] in orange). 
While traditional QC does a good job avoiding false 
negatives, the anomaly detection technique comple¬ 
ments that by identifying false positives without re¬ 
quiring a large sample of bad data for training, nei¬ 
ther suffering from the imbalance in the dataset. 

Temperature [°C] 



Probability 


Figure 5: Temperature measurements from the cruise 
PIRATA-X, profile 10. The data approved by the EuroGOOS 
procedure is shown in green, and the probability of being a 
good data, according to the anomaly detection, in orange. The 
small panel shows a zoom in the temperature between 720 and 
728 dbar. 

The full EuroGOOS procedure, which includes the 
climatology test, can be reproduced by the anomaly 
detection, when calibrated for that purpose, with a 
mistake rate of approximately 0.4%. Most of the dis¬ 
agreements are due to a known bias for false positives 
of the traditional QC techniques, as discussed earlier, 
therefore, that is an overestimate of errors from the 
anomaly detection. A better reference is necessary 
to properly evaluate the performance of each QC ap¬ 
proach, which was obtained through active learning, 
as follows: In a first iteration the anomaly detec¬ 
tion was calibrated and evaluated assuming the Euro- 
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Table 2: Observed temperatures and respective quality control 
test results for the samples 600 to 602 of the profile 10 of cruise 
PIRATA-X. This interval is also shown in the zoom of Figure 
[5] The last column shows the thresholds suggested by the 
EuroGOOS procedures. 



*£600 

^601 

*£602 

thr. 

Pressure [dbar] 

723 

724 

725 


Temp. [°C] 

7.03 

5.67 

6.31 


Gradient 

0.54 

1.00 

0.05 

3° 

Spike 

-0.28 

0.64 

-0.64 

2° 

Climatology 

0.75 

3.11 

1.28 

6 

Tukey 53H 

0.01 

0.28 

0.15 

1.5 

Anom. det. 

2e-6 

le-20 

9e-7 



GOOS as the truth. The severity of each supposedly 
misclassified measurement is quantified by the differ¬ 
ence between the probability threshold (ft) and the 
estimated probability of being a good data (ft). The 
rationale behind it is that to avoid a misclassification 
with large \p — p\ requires greater changes in the clas¬ 
sification parameters that ultimately defined p , hence 
that mistake would be in greater disagreement with 
the criteria used than one with a small \p — p\. That 
rank of misclassification drives an iterative process 
where the worst mistake is manually evaluated first 
and the flag is confirmed or corrected, so the reference 
is updated, the anomaly detection recalibrated, and 
the misclassification rank redefined. The human ef¬ 
fort is hence optimized into classifying first the most 
critical errors, while the calibration converges. The 
performance of each recalibration is evaluated using 
an independent dataset, the error subset described 
in Section |2.3[ and the iteration process ceases once 
the error on the error subset stabilizes or increases, 
hence, avoiding an over fitting. Such active learn¬ 
ing results in a better reference classification without 
manually processing the whole dataset. 

Table [3] shows the performance of the different 
QC procedures available in CoTeDe. The GTSPP 
and EuroGOOS for realtime (without climatology) 
achieved the lowest rate of false negatives, but also 
had the highest rate of false positives. To aggre¬ 
gate the climatology on those procedures reduces the 
ratio of false positives, while increasing the rate of 
false negatives. For the GTSPP that was critical, 


resulting in the worst overall performance, misclas- 
sify 1% of the dataset. The two fuzzy logic ap¬ 
proaches had a surprising high rate of false nega¬ 
tives, which could probably be improved by a better 
calibration schema. The anomaly detection achieved 
the best performance overall, with 2 misclassifications 
per 10,000 measurements, thus, reducing by half the 
errors by EuroGOOS for realtime, the former best 
procedure. It is worth noting that the error by the 
anomaly detection was estimated from the error sub¬ 
set, hence independent of the data used on the cali¬ 
bration. 


Groncll and Wijffels (2008) also explored the idea 


of identifying bad data by searching for outliers in 
multiple features, using an equivalent of the clima¬ 
tology test from the traditional QC, but applied on 
each feature instead of on the measurement itself. 
That was a major improvement over the traditional 
QC since the local variability scaled the test thresh¬ 
old, avoiding to use one constant for the whole ocean, 
known to be heterogeneous. Despite the differences in 
the methodologies, it is easy to see some conceptual 
equivalence with the technique that I propose. Some 
improvements from the anomaly detection do not as¬ 
sume a normal distribution and drastically reduce the 
manual QC effort, but the effective main advantage 
comes from the multidimensional criterium. 

The Q.C. methodology introduced here allows to 
include new features, so each feature aggregates a 
new perspective of the data that can help to identify 
sampling errors. For example, some specific cruises 
analyzed here had a persistent lack of data on the first 
tens of meters, near the surface, as well as a lower¬ 
ing speed faster than the recommended for CTDs. 
A human expert would note the improper operating 
procedures, being more likely to flag data as bad on 
the smallest indication. The anomaly detection could 
mimic that analysis by aggregating two new features: 
shallowest measurement in a profile, and descending 
rate. Morello et al. (2014) uses for other purposes the 
time since the last calibration, while [Gronell and Wi-| 
jffels (20081 introduce several other features, which 
could all be added in this example. In case any of 
those characteristics are too off the expected, that 
would contribute for the total uncertainty probabil¬ 
ity (p, Eq. 0, hence a smaller gradient or spike 
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Table 3: Ratio of errors per 10,000 measurements of different QC approaches estimated on the PIRATA hydrography dataset 
approved on the global range test, i.e. after removing the unfeasible values. For reference, manual QC resulted the ratio of 8.8 
bad data per 10,000 measurements. 


QC procedure 

False bad 

False good 

Total error 

GTSPP (w/ dim.) 

97.4 

2.9 

100.3 

GTSPP realtime 

0.0 

5.2 

5.2 

EuroGOOS (w/ dim.) 

4.3 

2.8 

7.1 

EuroGOOS realtime 

0.0 

4.1 

4.1 

Morello 2014 

11.6 

2.7 

14.3 

Fuzzy Logic 

12.0 

2.5 

14.5 

Anomaly detection 

0.2 

1.8 

2.0 


would be sufficient to exceed the acceptable p. The 
anomaly detection as proposed here is a quantitative 
way to accumulate different aspects of the data for a 
non -linear classification, thus making decisions with 
deeper context awareness. 


An intrinsic byproduct of the anomaly detection 
approach is to define how uncommon is a given sce¬ 
nario. 


Yao et al. (2010) discuss the importance of 


4. Concluding Remarks 

Machine learning techniques provide fascinating 
approaches to automatically classify data by employ¬ 
ing reinforced learning, which is based in the principle 
of training the classification system with some known 
answers. Such calibration usually requires sufficient 
data to statistically represent each class, but the bet¬ 
ter the measuring procedure, the smaller the amount 
of bad data. The oriented undersampling can circum¬ 


identifying realtime anomalous, but valid, measure¬ 
ments for management response, like to detect an 
algae bloom. This concept raises new possibilities 
for autonomous sampling systems. An intelligent 
sensor running an onboard realtime quality control 
could be setup to increase the sampling rate once a 
threshold on the probability of occurrence is reached. 
That would minimize the losses from bad samples, 
as well as increase the sampling resolution of uncom¬ 
mon events. It would be a major improvement on 
the optimization effort of the observing systems. For 
example, an underwater glider could stop in a place 
for one or two dives, before keeping its pre-planned 
mission, once it detected something different. An 
ARGO float could anticipate its cycle and redo a pro¬ 
file if the previous measurements were unexpected. It 
is common to keep subsurface moorings over a year 
at sea without any communication, so any interest¬ 
ing event would only be found after recovering the 
equipment. An intelligent adjusting sampling ratio 
would increase the spectrum coverage of autonomous 
sensors, with the same storage memory and power 
budget. 


vent the contrast in relative sampling sizes (Rahman 


et al. 2014), but too few bad data is a serious limita¬ 


tion for the usual approach of identifying each class, 
i.e. recognizing a bad measurement in the same way 
that it recognizes a good measurement. To aggra¬ 
vate that, observations in the open ocean are typi¬ 
cally sparse in space and time, allowing for few, if 
any, overlapping measurements. Thus, most machine 
learning techniques, such as Bayesian Networks and 
Support Vector Machines, although powerful, are not 
the most adequate approach for open water oceano¬ 
graphic data due to the intrinsic characteristics of 
such dataset. 

The novel approach based on the anomaly detec¬ 
tion technique that I propose strongly impacts the 
QC of oceanographic data in twofold. First it opti¬ 
mizes the expert effort by driving the manual evalu¬ 
ation into the most dubious measurements first, al¬ 
lowing the experts to efficiently handle the increas¬ 
ing amount of measurements in the oceans. Second, 
it combines multiple characteristics of each measure¬ 
ment for a deeper decision making, resulting in a 
higher context awareness for more intricate classifi- 
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cation. The same Anomaly Detection is not limited 
to QC, but could also be used to guide self adjusting 
sampling platforms, increasing the spectrum coverage 
of the measurements. 

The Python package CoTeDe is an open source 
platform to allow easy application of the current state 
of the art QC techniques on oceanographic measure¬ 
ments, with the possibility to customize the set of 
tests to be used. 


Software Availability 


Package name: 
Program language: 
Developer: 
Available since: 
Access: 

Website: 

Cost: 

License: 


CoTeDe 

Python 

Guilherme P. Castelao 
2013 

Open source 

http://cotede.castelao.net 
Free software 
3-clause BSD 
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